Skip to main content

3.1 - Goal - Learning a Single Economic Task

Our first foray into reinforcement learning will be a foundational "Hello, World!" task. The objective is to create the simplest possible environment where an agent can learn a single, vital economic behavior: producing worker units.

By isolating this one task, we can verify that our entire RL architecture—from the SC2GymEnv wrapper to the training script—is functioning correctly before adding further complexity.

Problem Definition

We will train an agent whose sole purpose is to learn the correct policy for building SCVs from a Command Center.

  • Success Criteria:
    • The agent must learn to check for sufficient resources (minerals >= 50).
    • The agent must learn to check for an available production facility (idle Command Center).
    • The agent must learn to correlate the correct game state with the "build worker" action.
    • The episode must terminate successfully upon reaching a target of 20 workers.

Environment Design

To facilitate this learning, we will design a minimal environment with precisely defined observation, action, and reward structures.

1. Observation Space (The Agent's Input)

The state provided to the agent must contain only the information necessary to make the "build worker" decision. All values are normalized to a [0.0, 1.0] range to improve neural network stability.

IndexFeatureNormalization FactorPurpose
0self.minerals1000"Can I afford the action?"
1self.workers.amount50"How close am I to the goal?"
2self.supply_left20"Do I have the capacity for a new unit?"

2. Action Space (The Agent's Output)

The agent has a discrete, binary choice on each decision step.

Action ValueAgent's Intent
0Do Nothing
1Attempt to build an SCV

3. Reward Function (The Learning Signal)

The reward function is engineered to provide clear and immediate feedback, guiding the agent toward the correct behavior.

Triggering ConditionReward ValueSignal TypePurpose
"Build" action is taken and is possible+5SparseStrongly reinforce the correct decision.
"Build" action is taken but is impossible-5SparseStrongly penalize impossible actions (teaches game rules).
Every decision step+ (worker_count * 0.1)DenseContinuously encourage progress toward the goal.

Key Learning Challenges for the Agent

  • Credit Assignment: The agent must learn that the +5 reward is directly caused by its "Build" action when resources are sufficient.
  • Constraint Learning: The agent must learn to avoid the -5 penalty by paying attention to the minerals and supply_left observations, effectively learning the game's constraints.
  • Temporal Progression: The dense reward for worker_count encourages the agent to not just "do nothing," but to actively pursue the goal over time.

This tightly scoped problem provides a perfect, verifiable test case for our RL framework. In the next section, we will translate this design into code.